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Abstract 

We suggest using the max-norm as a convex surrogate constraint for clustering. We show how this 
yields a better exact cluster recovery guarantee than previously suggested nuclear-norm relaxation, and 
study the effectiveness of our method, and other related convex relaxations, compared to other clustering 
approaches. 



1 Introduction 

Clustering as the problem of partitioning data into clusters with strong similarity inside the clusters and 
strong dissimilarity across different clusters is one of the main problems in machine learning. In this paper, 
we consider the problem of cut-based, or correlation, clustering [4] that has received a lot of attention 
recently [TJE2JG]: Given G(V,£) on n nodes with normalized symmetric affinity matrix A (for all u, v £ V: 
< A uv < 1 and A uu = 1), we want to partition V into clusters C = {Ci, . . . , Ck} so as to minimize the total 
disagreement 

k k 



i—1 u,v£Ci iy£j = l u£Ci,v£Cj 

The first term, captures the internal disagreement inside clusters, and the second term captures the external 
agreement between nodes in different clusters. In an ideal cluster, the affinities between all members of 
the same cluster are 1 and the affinities between members of two different clusters are zero and hence the 
objective is zero. This objective does not require the number of clusters to be known ahead of time — we 
may decide to use any number of clusters, and this is accounted for in the objective. Unfortunately, finding 
a clustering minimizing the disagreement D(C) is NP-Hard [3|. 

We formulate this problem as an optimization of a convex disagreement objective over a non-convex set 
of valid clustering matrices (Section [2| and then consider convex relaxations of this constraint. Recently, 
Jalali et al. [16] suggested a trace-norm (aka nuclear-norm) relaxation, casting the problem as minimizing 
an t\ loss and a trace-norm penalty, and providing conditions under which the true underlying clustering is 
recovered. Instead of trace-norm, we propose using the max-norm (aka 72 : i\ —> too norm) [30], which is a 
tighter convex relaxation than the trace-norm. Accordingly, we establish an exact recovery guarantee for our 
max-norm based formulation that is strictly better then the trace- norm based guarantee. We show that if the 
affinity matrix is a corruption of an "ideal" clustering matrix, with a certain bound on the corruption, then 
the optimal solution of the max-norm bounded optimization problem is exactly the ideal clustering (Section 



3.1). We also discuss even tighter convex relaxations related to the max-norm, and suggest augmenting 
the convex relaxation with a single-linkage post-processing step in case of non-exact recovery, showing the 
empirical advantages of these approaches (Section |5|. 

The approach we suggests relies on optimizing an i\ objective subject to a max-norm constraint. A 
similar optimization problem with a trace-norm constraint (or trace-norm regularization) has recently been 
the subject of some interest in the context of "robust PCA" [HI [33] and recovering the structure of graphical 
models with latent variables [ID] . As with the trace-norm regularized variant, the l\ + max-norm problem 
can be formulated as an SDP and solved using standard solvers, but this is only applicable to fairly small scale 
problems. In Section [4] we discuss various optimization approaches to this problems, including approaches 
which preserve the sparsity of the solution. 



1.1 Relationship to the Goemans Willimason SDP Relaxation 

Our convex relaxation approach is related to the classic SDP relaxations of max-cut [13] and more generally 
the cut-norm [2]. In fact, if we are interested in a partition to exactly two clusters, the correlation clustering 
problem is essentially a max-cut problem, though with both positive and negative weights (i.e. a symmetric 
cut-norm problem), and our relaxation is essentially the classic SDP relaxation of these problems. Our 
approach and results differ in several ways. 

First, we deal with problems with multiple clusters, and even when the number of clusters is not pre- 
determined. If the number of clusters k is pre-determined, the correlation clustering problem can be written 
as an integer quadratic program, with a k variables per node, and can be relaxed to an SDP. But this SDP 
will be very different from ours, and will involve a matrix of size nk x n/c, unlike our relaxation where the 
matrix is of size n x n regardless of the number of clusters. Consequently, the rounding techniques based on 
(random) projections typically employed for classic SDP relaxations do not seem relevant here. Instead, we 
employ a single-linkage post-processing as a form of "rounding" imperfect solutions. 

Second, the type of guarantees we provide are very different from those in the Theory of Computation 
literature. Most of the SDP relaxation work we are aware of (including the classical work cited above) focuses 
on worst case constant factor approximation guarantees. On one hand, this means the guarantee needs to 
hold even on "crazy" inputs where there is really no reasonable clustering anyway, and second, and on the 
other hand it is not clear how approximating the objective to within a constant factor translates to recovering 
an underlying clustering. Instead, we prove that when the affinity matrix is close enough to following some 
underlying "true" clustering, the true clustering will be recovered exactly. This type of guarantee is more 
in the spirit of compressed sensing, which where exact recovery of a support set is guaranteed subject to 
conditions on the input [16] . 

1.2 Other Clustering Approaches 

There are several classes of clustering algorithms with different objectives. In hierarchical clustering algo- 
rithms such as UPGMA [28] . SLINK [27] and CLINK [IT] the goal is to generate a sequence of clusterings 
by produce a sequence of clustering by merging/splitting two clusters at each step of the sequence according 
to a local disagreement objective as opposed to our global D(C). Because of this locality, these methods are 
known to be very sensetive to outliers. 

Cut-based clustering algorithms such as /c-means/medians J3TJ [15], ratio association [26], ratio cut [9 
and normalized cut [34] try to optimize an objective function globally. The main issue with these objectives 
is that they are typically NP-Hard and need to know the number of clusters ahead of time, since these 
objectives are monotone in the number of clusters. 

In contrast, spectral clustering algorithms [32] try to find the first k principal component of the affinity 
matrix or a transformed version of that [24]. These methods require the number of clusters in advance and 
has been shown to be tractable (convex) relaxations to NP-Hard cut-based algorithms [12]. These methods 
are again very sensitive to outliers as they might change the principal components dramatically. 

2 Problem Setup 

Our approach is based on representing a clustering C through its incidence matrix K(C) G R nxn where 
K uv = 1 iff u and v belong to the same cluster in C (i.e. u,v G C{ for some z), and K uv = otherwise (i.e. if 
u and v belong to different clusters). The matrix K(C) is thus a permuted block- diagonal matrix, and can 
also be thought of as the edge incidence matrix of a graph with cliques corresponding to clusters in C. We 
will say that a matrix K is a valid clustering matrix, or sometimes simply valid, if it can be written 
as K = K(C) for some clustering C (i.e. if it is a permuted block diagonal matrix, with Is in the diagonal 
blocks). 

The disagreement can then be written as either: 




(i) 



u,v 



or as: 




(2) 



2 



where the term ^2 UV A uv does not depend on the clustering C and can thus be dropped. 

We now phrase the correlation clustering problem as matrix problem, where we would like to solve 

min D(K) s.t. K is a valid clustering matrix. (3) 

The problem is that even though the objectives ([!]) and Q are convex, the constraint that K is valid 
is certainly not constraint. Our approach to correlation clustering will thus be to relax this non-convex 
constraint (the validity of K) to a convex constraint. 

We note that although both the absolute error objective ([I]) and the linear objective Q agree on valid 
clustering matrices (or more generally, on binary matrices K), they can differ when K is fractional, and 
especially when A is also fractional. The choice of objective can thus be important when relaxing the 
validity constraint to a convex constraint. More specifically, as long as A is binary (i.e. A uv G {0, 1}), and 
< K uv < 1, even if K is fractional, the two objectives agree. Non- negativity of K uv is ensured in some, 
but not all, of the convex relaxations we study. When non-negativity is not ensured, the absolute error 
objective ([!]) would tend to avoid negative values, but the linear objective might certainly prefer them. More 
importantly, once the affinities A uv are also fractional, the two objectives differ even for < K uv < 1. While 
the linear objective would tend to not care much about entries with affinities close to 1/2, the absolute error 
objective would tend to encourage fractional values in thees cases. 

The linear objective also has some optimization advantages over the absolute function as well. From a 
numerical optimization point of view, dealing with the linear objective function is easier since we do not need 
to compute the sub-gradients of the £i-norm. 



3 Max-Norm Relaxation 

As discussed in the previous Section, we are interested in optimizing over the non-convex set of valid clustering 
matrices. The approach we discuss here is to relaxing this set to the set of matrices with bounded max-norm 
[30] . The max-norm of a matrix K is defined as 

||tf||max= min ||i?|| 00>2 ||L||oo,2 
K—RL 

where, || • \\oo,2 is the maximum of the £2 norm of the rows, and the minimization is over factorization of any 
internal dimensionality. It is not hard to see that if K is a valid clustering matrix, with K = K{C), then 
11X11 max — 1. This is achieved, e.g., by a factorization with R — L, and where each row R u of R is a (unit 
norm) indicator vector with R ui = 1 for u G C{ and zero elsewhere. 

Relaxing the validity constraint to a max-norm constraint, and using the absolute error objective, we 
obtain the following convex relaxation of the correlation clustering problem: 

K = arg min {{A-K^ s.t. ||K|| max < 1. (4) 

Alternatively, we could have used the linear objective Q instead. In any case, after finding K, it is easy 
to check whether it is valid, and if so recover the clustering from its block structure. If K is valid, we are 
assured the corresponding clustering is a globally optimal solution of the correlation clustering problem. 



3.1 Theoretical Guarantee 

Assuming there exists an underlying true clustering, we provide a worst-case (deterministic) guarantee for 
exact recovery of that clustering in the presence of noise when the affinity matrix A is a binary — 1 matrix 
using absolute objective. The flavor of our result is similar to [16] for trace-norm, except that we show 
the max-norm constraint problem recovers the underlying clustering with larger noise comparing to trace- 
norm constraint. This matches our intuition that max-norm is a tighter relaxation than trace- norm for valid 
clustering matrices. 

To present our theoretical result, we start by introducing an important quantity that our main result is 
based upon. Suppose C* = {C* , . . . , C£} is the underlying true clustering. For a node u and a cluster C*, 

let d U} c? = |£*rj — — if u ^ C* and d M) c* =1 j<^»j — — otherwise and 

D max (A,K) = D max (A,K(C*)) = max d UtC * 
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UnBalanceness 

Figure 1: Theorem [I] guarantee region of the noise level D max vs the unbalanceness parameter 
k* l^i (jc min | J ' 

be the maximum of the disagreement ratios on the adjacency matrix. This definition is inspired by [16] but 
is slightly different. Notice that the larger D max (A,K) is, the more noisy (comparing to ideal clusters) the 
graph is; and hence, the harder the clustering becomes. In particular for ideal clusters (fully connected inside 
and fully disconnected outside clusters), we have D max (A, K) = 0. 

We would like to ensure that when D max (A, K) is small enough, our method can recover K. The following 
lemma helps us understand the information theoretic limit of D max (i, K), i.e. what value of D max is certainly 
not enough to ensure recovery, even information theoretically: 

Lemma 1. For any clustering C = {Ci, . . . , C&} and for all 7 > with r = ^ n \c-\ 2 7 there exists an 
affinity matrix A such that D max (A, K(C)) = 7 and the combinatorial program Q does not output C. 

Note that the minimum of is attained when all clusters have equal sizes. If we have k* clusters of 
size pr, then r = k* and the bound in Lemma [l] asserts that if D max (A, K) > , then there are examples 
for which the original clustering cannot be recovered by the combinatorial program ([3|. This implies that 
Anax(A K) cannot be scaled better than O(^r) in general even without convex relaxation. 

Suppose there exist a true underlying clustering C* with k* clusters. Let C m i n be the smallest size 
underlying true cluster and we are given an affinity matrix A with D max = D max (A, K(C*)). Introducing 
lagrange multiplier /i, we consider the optimization problem 

= arg min ^ \\A - + /j, \\K || max . (5) 
k n 

The following theorem characterizes the noise regime under which the simple max-norm relaxation ([5| recovers 
C*. 

Theorem 1. For binary — 1 matrix A, if D max < -^q^ is small enough to satisfy ^ J2i ( jjr^ | ) — 

(1+dSdL then > f° r an V ^ satis fv in 9 (i-3otS)h5Ll a < < SwP' the matnx (the 

solution to ^) is unique and equal to the matrix K* = K(C*) (the solution to 

Remark 1: Consider the parameter ^- J2 i ^ ^\ ■ | ^ in the theorem. Notice that for a balanced underlying 

clustering (fc* clusters of size n/k*), this parameter is 1 and as the underlying clustering gets more and 
more unbalanced, this parameter increases. That motivates to call it unbalanceness of the clustering. It is 
clear that as unbalanceness parameter increases, the region of I} m ax for which our theorem guarantees the 
clustering recovery shrinks. We plot the admissible region of Z} m ax due to unabalanceness in Fig[l] 
Remark 2: According to the Lemma [l] the bound on D max is order- wise tight and can be only improved 
by a constant in general. 

3.2 Comparison to Single-Linkage Algorithm 

Considering single- linkage algorithm (SLINK) [27] as a baseline for clustering, we compare the power our 
algorithm in cluster recovery with that. SLINK generates a hierarchy of clusterings starting with each node 
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Figure 2: Starting from two ideal clusters of size 18, we disconnect node A from 4 nodes on the left cluster 
and connect it to 11 nodes on the right cluster. Moreover, we disconnect node B from 4 nodes on the right 
cluster and connect it to 7 nodes on the left cluster as shown. The optimal clustering according to [3] is still 
the original two clusters. 



as a cluster. At each iteration, SLINK measures the similarity of all pairs of clusters and combines the most 
similar pair of clusters to a new cluster. We consider the closedness of the columns Ai and Aj as the similarity 
measure of nodes i and j. 

Consider the graph shown in Fig. [2] With exhaustive search, one can show that the non-convex problem 
([3| outputs two clusters as shown. Running SLINK on this graph, the algorithm first finds two cliques of 
size 17 and nodes A and B as four separate clusters in the hierarchy. Next, it combines nodes A and B as a 
separate cluster since they are more similar to each other than to their own clusters. This means that single 
linkage algorithm will never find the correct clustering. However, it can be easily checked that our proposed 
max-norm constrained algorithm will recover the solution of pi). 



3.3 Comparison to Trace-Norm Constrained Clustering 

Since the max-norm constraint is strictly a tighter relaxation to the trace- norm constraint, we expect the 
max-norm algorithm to perform better. Our theorem shows improvement over the guarantees provided 
for trace-norm clustering. Comparing to the result of [16 on trace- norm (D max < ^^), the max-norm 



tolerates more noise. To see this, consider a balanced clustering, then trace-norm requires D max < and 
max-norm requires D max < min( k } +1 , 0.1789) which is larger than for all k* . The difference gets more 
clear for unbalanced clustering. Suppose we have one small cluster of constant size |C m i n | and other clusters 
are approximately of size j^. As (n, k*) scales, trace-norm guarantee requires that D max = o(^) which is 
inverse proportional to the size of the smallest cluster, whereas, max-norm guarantee requires D max = °(^r) 
which is inverse proportional to the size of the largest cluster. This is a huge theoretical advantage in our 
theorem. 

Besides comparing the provided guarantees, we compare max-norm clustering with trace- norm clustering 
both deterministically and probabilistically. Running Trace-Norm constrained minimization [16] on the graph 
shown in Fig. |2j the resulting clustering consists of two clusters and node B belongs to the correct cluster. 
However, node A belongs to both clusters! - The clustering matrix contains two blocks of ones and the 
row/column corresponding to the node A contain all ones. Also, the diagonal entry corresponding to node 
A is larger than one and the diagonal entry corresponding to the node B is less than one. In short, this 
algorithm is confused as of which cluster the node A belongs to. 

Further, we compare our algorithm with trace- norm algorithm [16] and SLINK on a probabilistic setup. 
Start from two different ideal clusters on 100 nodes: a) Balanced clusters: four ideal clusters of size 25, b) 
Unbalanced clusters: three ideal clusters of size 30 and one ideal cluster of size 10. Then, gradually increase 
D max on both graphs and run all algorithms and report the probability of success in exact recovery of the 
underlying clusters. Although our theoretical guarantee is for binary affinity matrices, here, we run the same 
experiment for fractional affinity matrix. We run all experiments for both absolute and linear objectives. 



Fig. 3.1 shows that in all cases max-norm outperforms the trace-norm and the improvement is more significant 
for unbalanced clustering with fractional affinity matrix. Moreover, this experiments reveal that the absolute 
objective has slight advantage if the affinity matrix is binary and clusters are balanced; otherwise, the linear 
objective is better. 



5 




(a) Balanced; Binary 



8 0.7 





*'». \. \ % n Max-Abs 


SLINK | \ 


'* \ \>\ \ - " Trace-Abs 












\ \ \ \ ^ *" Enhanced Algo 










Theorem 1 





0.05 0.0779 0. 



Noise Level D 



(b) UnBalanced; Binary 



Max-Abs 

Max-Lin 

- - Trace-Abs 
■ ■ Trace-Lin 





(c) Balanced; Fractional 



(d) UnBalanced; Fractional 



Figure 3: Probability of exact clustering recovery for max- norm and trace- norm constrained algorithms under 
absolute \\A — K\\i and linear J2ij ~ ^ij) objectives. There are 4 clusters of size 25 for the balanced 

case and three clusters of size 30 + one cluster of size 10 for the unbalanced case. We consider two cases 
for each graph; where the affinity matrix is binary and when it is not. We both show the results for simple 
max-norm relaxation (basic algorithm) and tighter relaxations presented in Section [5] (enhanced algorithm). 
The result shows that max-norm constrained optimization recovers the exact clustering matrix under higher 
noise regimes better than trace-norm and single- linkage algorithm. Also, the linear objective seems to be 
performing better than the absolute objective for the clustering problem in most cases. 



4 Max-norm + ^i-norm Optimization 

In this Section we consider optimization problems of the form Q. This problem recovers a sparse and 
low-rank matrix from their sum, considering max-norm as a proxy to rank. In Section 4.1, we discuss how 
Q can be formulated as an SDP, allowing us to easily solve it using standard SDP solvers, as long as the 
problem size is relatively small. We then propose three other methods to numerically solve the optimization 
problem (El). 



4.1 Semi-Definite Programming Method 

Following Srebro et al. [30 , we introduce dummy variables L, R E R nxn and reformulate Q as the following 
SDP problem 



K = arg min \\A — K\\\ 

K,L,R 



S.t. 



L K 
K T R 



y and La, Ru < 1 



These constraints are equivalent to the condition || ma x < 1- This SDP can be solved using generic SDP 
solvers, though is very slow and is not scalable to large problems. 



4.2 Factorization Method 

Motivated by Lee et al. [20], we introduce dummy variables L,R G 
change of variable, we can reformulate Q as 



and let K = LR . With this 
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K = LR T = arg min \\A - L# T ||i 

S.t. ||L||oc,2,P||oc,2 < I- 

This problem is not convex, but it is guaranteed to have no local minima for large enough size of the problem 
[7]. Furthermore, if we now the optimal solution K has rank at most r, we can take L, R to be R nx ( r+1 ). 
In practice, we truncate to some reasonably high rank r even without a known gurantee on the rank of the 
optimal solution. To solve this problem iteratively, Lee et al. [20] suggest the following update 

L 
R 

The projection V m &x(-) operates on rows of L and R; if ^2-norm of a row is less than one, it remains unchanged, 
otherwise it will be rescaled so that the i^-norm becomes one. 

A possible problem with the above formulation is the lack of "sparsity" in the following sense: The i\ 
objective is likely to yield and optimal solution K* with many non-zeros in A — K* , i.e. where K* is exactly 
equal to A on some of the entries. However, gradient steps on the factorization are not likely to end up in 
exactly sparse solutions, and we are not likely to see any such sparsity in solutions obtained by the above 
method. 



— ^max 
k+1 



L 
R 



k Vk 



Sign(A 
Sign(A 



LR ) R 
LR T ) T L 



4.3 Loss Function Method 

There are gradient methods such as truncated gradient [18] that produce sparse solution, however, these 
methods cannot be applied to this problem. We introduce a surrogate optimization problem to Q by adding 
a loss function. For some large A G M, solve 



K 



arg mm 

Z,L,R 

s.t. 



\\Z\\i 

lilloo. 



X\\A- Z -LR 

,Woo,2 < I- 



2 



Here, the matrix Z is sparse and includes the disagreements. For sufficiently large values of A, the loss 
function ensures that the matrix A — Zis close to the matrix LR T that is a bounded max-norm matrix. To 
solve this problem iteratively, we use the following update 



Zk+i = Vg x z k 



rA 



(A- 



L " 


-v ( 


" L 


rA r 


R 


' max I 

V 


R 


k vk _ 



Z - LR T ) k 

(A — Z — LR T ) R 
(A- Z -LR T ) T L 

Here, Vi 1 (-) operates on entries; if an entry has the same sign before and after the update, it remains 
unchanged; otherwise, it will be set to zero. Solving directly for large values of A might cause some problems 
due to the finite numerical precision. In practice, we start with some small value say A = 1 and double the 
value of A after some iterations. This way, we gradually put more and more emphasis on the loss function as 
we get closer to the optimal point. 



4.4 Dual Decomposition Method 

Inspired by Rockafellar [25], we first reformulate Q by introducing a dummy variable Z G R nxn as follows 

K arg min \\A - K\U 

Z,K 11 11 

s.t. ||Z|| max <l and Z = K. 
Then, introducing a Lagrange multiplier A G R nxn , we propose the following equivalent problem: 

K = arg maxmin \\A - K\\ x + ((A, K - Z)) 
s.t. ||Z|L ax < 1. 
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Figure 4: Comparison of the proposed numerical optimization methods in terms of the sparsity of the solution 
they provide and the t\ error of the estimation. 



Here, ((•, •)) is the trace of the product. This problem is a saddle-point convex problem in (Z, if, A). To solve 
this, we iteratively fix A and optimize over (if, Z) and then, using those optimal values of (if, Z), update A. 
For a fixed A, the problem can be separated into two optimization problems over if and Z as 

if (A) = arg min \\A - if ||i + ((A, if)) 

which can be solved using factorization method discussed above, and 

Z{\) = arg min -((A, Z)) 

s.t. ||Z|| max < 1. 

which is a soft thresholding; if |A^| > 1 then, if(A)^- = — Sign(A^); otherwise if(A)^- = A^. 
Using if (Afc) and Z(A k ), we update A as follows 

A fc+1 =A fc -^(if(A fc )-Z(A fc )) 

until it converges. One criterion for the convergence of this method is to round both matrices if, Z and check 
if they are equal. To use this criterion, we need to initialize the two matrices very differently to avoid the 
stopping due to the initialization. 

4.5 Numerical Comparison 

We compare the performance of these methods. For three ideal clusters of size 20 with noise level D maK , we 
run all three algorithms for 2000 iterations. We consider an initial step size r = 1 for all methods, and, for 
the loss function method, we doubel A every 100 iterations. For the dual method, we update A for 20 times 
and run 100 iterations of the factorization method for the max-norm sub-problem at each update. We report 
the sparsity of the solution A — if as well as the £i-norm of the error ||if — if* ||i for each algorithm in Fig [4] 
This result shows that there is a trade-off between sparsity and the error - the dual optimization method 
provides consistently a sparse solution, where, factorization and loss function methods provide small error. 
The sparsity of loss function method gets worse as the noise increases. 



5 Tighter Relaxations 

In this section, we improve our basic algorithm in two ways: first, we use a tighter relaxation for valid 
clustering constraint and second, we add a single- linkage step after we recovered the clustering matrix. 
Although max-norm is a tighter relaxation comparing to trace- norm, we would like to go further and introduce 
tighter relaxations. Figure [5] summarizes different possible relaxations based on max-norm. The arrows in 
this figure indicated the strict subset relations among these relaxations. The tightest relaxation we suggest 
is {if = RR T : ||i?||oo,2 < 1, R > 0} based on the intuition that a clustering matrix is symmetric and has a 
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{K : ||X|| m ax<l, K^O, K>0} 

t(3) 

{K = RR T : p||oo,2<l, R>0} 

Figure 5: Summary of possible convex relaxations of the set of valid clustering matrices and their relations. 
Here, || • ||* represents the trace (nuclear) norm, || • \\oo,2 represents the maximum £ 2 norm of the rows, ">" is 
used for element-wise positiveness and is used for positive semi-definiteness. Each double-ended arrow 
represents the equivalence of two sets. Each single-ended arrow in this figure represents a strict sub- set 
relation between two sets. 

trivial factorization R G R nxfe , where, Rij is non-zero if node i belongs to cluster j. Next lemma formalizes 
this result. 

Lemma 2. All relaxation sets shown in Fig. [5] are convex and the strict subset relations hold. 

This suggests using the tightest convex relaxation, that is constraining to K such that there exists 
R >= 0, ||i?||cx),2 <= 1 with K = RR T (the set of matrices K with a factorization K = RR T ,R >= is 
called the set of completely positive matrices and is convex [5]). We optimize over this relaxation by solving 
the following optimization problem over R: 

R = arg min llA-RR 7 ^^ 

R (6) 

S-t. ||i?||oo,2 < 1 & #>0. 

and setting K = RR T . Although the constraint on K is convex, the optimization problem (|6| is not convex 
in R. 

5.1 Single-linkage Post Processing 

The matrix K extracted from ^ might diverge from a valid clustering matrix in two ways: firstly, it might 
not have the structure of a valid clustering and secondly, even if it has the structure, the values might not 
be integer. We run SLINK on K as a "rounding scheme" to fix both of the above problems. SLINK gives a 
sequence of clusterings Ci, . . . ,C n . To pick the best clustering, we choose 

K = axgmin\\A-K(C i )\\ 1 . (7) 

i 

The matrix K can be viewed as a refined version of the affinity matrix A and hence the second step of the 
algorithm can be replaced by other hierarchical clustering algorithms. The criterion of choosing the best 
clustering in the hierarchy comes naturally from the correlation clustering formulation. 

5.2 Comparison with Other Algorithms 

We compare our enhanced algorithm with the trace-norm algorithm [16] followed by SLINK and SLINK itself. 
In all cases we pick a clustering from SLINK hierarchy using (|7|). The setup is identical to the experiment 
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Figure 6: Comparison of our best proposed method which is the linear objective over tight relaxation (followed 
by a single-linkage algorithm) with trace- norm counterpart, single-linkage algorithm and spectral clustering. 
Here, we plot the entropy-based distance of the recovered clustering with the underlying true clustering. 




Figure 7: Comparison of our best proposed method which is the linear objective over tight relaxation (followed by 
/c-means) with trace-norm and spectral clustering in terms of time complexity and clustering error on MNIST dataset. 



explained in Section [373] Fig |3 . 1 1 summarizes the results and shows that our enhanced algorithm outperforms 
all competitive methods significantly. 

Besides the exact recovery of the underlying clustering, we would like to investigate that as noise level 
D max increases, how bad the output of our algorithm get. Using "variation of information" [23] as a distance 
measure for clusterings, we compare our algorithm with linear objective with trace- norm counterpart, SLINK 
and spectral clustering [32 for both balanced and unbalanced clusterings described before. For the spectral 
clustering method, we first find the largest k = 4 principal components of A and then, run SLINK on 
principal components. Fig [5] shows the result indicating that max- norm, even when the noise level is high 
and no method can recover the exact clustering, outputs a clustering that is not far from the true underlying 
clustering in our metric. 



5.3 MNIST Dataset 

To demonstrate our method in a realistic and larger scale data set, we run our enhanced algorithm, trace- norm 
and spectral clustering on MNIST Dataset [19 . For each experiment, we pick a total of n data points from 
10 different classes (n/10 from each class) and construct the affinities using Gaussian kernel as explained in 
[6]. We report the time complexities and clustering errors as previous experiment in Fig 5.2 For the spectral 
clustering, we take SVD using Matlab and pick the top 10 principal components followed by /c-means. 
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(a) Original Clustering 



(b) Alternative Clustering 



Figure 8: Illustration of two alternative clusterings on the same graph with D max = 7- Each gray cloud of 
points is a clique. Each link between two clouds of points connects every points on one cloud to every points 
on the other cloud. 

A Proof of Lemma [2] 

Provided equivalences (1) and (2), it is clear that {K = LR T : ||L||oo,2 < 1, ||-R||<x>,2 < 1} and {K = RR T : 
H^Hoo, 2 < 1} are both convex sets. Since {K = RR T : ||i?||oo,2 < 1,-R > 0} is the intersection of two sets 
{K = RR T : ||i?||oo,2 < 1} and CV{K = RR T : R > 0}, it suffices to show that CV is a convex set. The 
set CV is called the set of completely positive matrices and has been shown to be a closed convex cone (see 
Theorem 2.2 in [5] for details). 

For the proof of equivalence (1) see Lemma 15 in [29]. To prove equivalence (2), it is clear that {K — 
RR T : ||#||oo,2 < 1} C {K : ||iq max <1,K h 0}. Now, suppose K G {K : ||K|| max < 1,K h 0}; let 
Ro = V^o and in contrary, assume that ||i?o||oo,2 > 1- This implies that at least one element on the diagonal 
of Ko exceeds 1 and hence ||i£o||max > 1- This is a contradiction and hence the equivalence (2) follows. 

To show the relation (3), it suffices to show that the sub-set relation is strict, since the sub-set relation 
itself is trivial. By counter-example provided in [14], the sub-set relation is strict (i.e., there is a positive 
semi-definite and positive entry Ko that does not belong to CV). 

B Proof of Lemma [T] 

We construct an example with D max = ^ that cannot be recovered. Consider the clustering shown 



in Fig. 8(a) It is clear that for this clustering, we have Z} max = 7 and 



B(C 1 )= 1 2 J2\Ci\ 2 + \Y,\ C ^ n -\ C ^- 
Now, consider the alternative clustering shown in Fig. |8(b)| For this alternative clustering, we have 

k 

B(C 2 )= 7 (l-2 7 )El^| 2 - 

i=l 

It is clear that B(C,2) < B(C\) (the alternative is a better clustering) for 7 > ^ • 



C Proof of Theorem [T] 

The proof has two main steps; in the first step, we characterize a sufficient optimality condition set based on 
the existence of a dual variable and in the second step, we construct such dual variable. For the sake of the 
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proof, we consider a useful equivalent definition [21] of the max norm as 



\K\ 



max — max 

X: X 2 <1 



\KoX\\ 2 



(8) 



where, || • H2 is the spectral norm (maximum eigenvalue) of the matrix and " o " is the Hadamard element- wise 
product. 

C.l Notation 

In this section, we introduce our notation and definitions used throughout the paper. 
C.l.l Residual Matrix Notations 

In general, we do not expect the residual matrix B* = A — K* to be sparse unless we threshold the affinity 
matrix (or we have adjacency matrix). However, to provide a guarantee, we need to characterize the sub- 
gradient of the £i-norm and hence distinguish between zeros and non-zeros of B* . Let 



Q = {B e 



: B = B T , Supp(£) C Supp(£*)}, 



(9) 



where, Supp(-) is the index set of non-zero entries. The orthogonal projection of a matrix M to this space 
is defined to be a matrix of the same size with Vn(M)ij = Mij if G Supp(£?*) and zero otherwise. The 
orthogonal complement of this space is denoted by Q 1 - and the projection is defined as Vq± (M) = M—Vn(M). 

C.1.2 Clustering Matrix Notations 

Let U G R nx/c * be constructed as 



U 



L |Ci| 



'\C2\ 



L |C 2 | 



l \c k *\ 



(10) 



pnxfe* 



} to be the space of matrices sharing either row or column space 



Define T = {UX T + YU T : X, Y G 
with U. The orthogonal projection to this space can be defined as 

V r (M) = UU T M + MUU T - UU T MUU T , 

where, 



UU 1 



lldlxldl 



T^T A |c 2 |x|c 2 | 



\c^r\ L \c k *\x\c k *\ 



Denote the orthogonal complement of the space T by T x equipped with projection Vj-± (M) = M — Vy{M). 
Let a = 2D max be the contraction between the ideal clusters and disagreements (See Lemma [5] for more 
details on this definition). Under the assumption of the theorem, we have a < 1 and hence, Tfl £1 = {0}. 



Using definitions in (11), let 
where, 



]^jl|Oi|x|C7i| 

Vie a Wl 1|Ca|x|Cl1 



X* = W(Z*) + V(UU T ), 

7^W\ llClMC21 



Vl^MCll l^*l X l Cl l y/\C k *\\C 2 \- 



■\C k *\x\C 2 \ 



VlCi||c fc .| 1 l Cl l x l c ^l 



\C^T\ L \C k *\x\C k *\ 
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Notice that Vr{X*) = UU T and hence X* - UU T G T ± . If we show that X* - UU T has spectral norm 
less than 1, then it is immediate that X* G arg maxx : ||x|| 2 <i 11^* ° ^Ib- Also, we have an eigenvalue 
decomposition K* o X* = [17 V]S[/7 F] T , where, 17 is as defined above and contains the eigenvector (s) 
corresponding to the maximum magnitude eigenvalue +1 (with k* repetitions). To bound the spectral norm 
of X* - UU T ', consider 

||X*-/7/7 T || 2 = ||W(^(Z*-^ T ))|| 2 

< Anax (r _1) <L 

1 — a 

The first inequality follows from Lemma [6j We make assumptions so that the last inequality holds. 

We use the variational form ([8| to characterize the sub-gradient of the max- norm at the point K*. 

Lemma 3. For a matrix M G R nxn , we have M G <9||7r || max if M = (USU T + W) oX*, for some diagonal 
positive semi-definite matrix S G R rxr with Trace (S) = 1 and for some matrix W G ]R nxn with Vt(W) = 
and \\W\\* < 1. 

Proof Using the variational form ([8| and theorem 4.4.2 in [17] on the sub-gradient of the maximum of convex 
functions, we have 

<9||iToX*|| 2 c d\\K*\\ max- 
Thus, it suffices to show that M G d\\K* o X*|| 2 (which is the case). 

□ 



C.2 Sufficient Optimality Conditions 

We provide similar optimality conditions to those provided in i\ plus trace norm minimization in the litera- 
ture. The main difference here is the existence of the auxiliary variable X* in the conditions. The following 
lemma characterizes a sufficient optimality condition set. 

Lemma 4 (Sufficient Optimality Condition.). K* = (Problem Q = Problem if T H 9, = {0} and 

there exists a dual matrix Q such that 

(a) Vq(Q o X*) = -^#Sign(A - K*) 

(b) ||^(QoX*)|| oo <^# 

(c) Vt(Q) — USU T , for some diagonal matrix S >z with Trace(S) = \i. 

(d) \\V T ±{Q)\l<H- 



Proof. Notice that since X* by construction has no zero entry (except for the very corner case where there 
are only two clusters both of size 2), the matrix Q o X* can take any value/sign on each entry by choosing 
the values of Q properly. Under these conditions, Q o X* G d\\A — K*\\i and also Q o X* G <9||if*|| max and 
the result follows from the standard first order optimality argument and zero duality gap of both t\ and max 
norms. 

□ 



C.3 Dual Variable Construction 

First notice that under the assumption of the theorem, we have a < 1 and hence, by Lemma [5j we have 
T C\Q, = {0} and also \i = fiQ is feasible. Second, we construct Q by using alternating projections. Consider 
the infinite sums 

W(M) = M — V r {M) + Vn{V T {M)) - V T {Vn{V T {M))) + ... 
V (N) = N - V n (N) + V T (V a (N)) - Vn(V T (Vn(N))) + ... 1 ' 

By the proof of the Lemma |5j these sums converge geometrically with parameter a (See Lemma 5 in [16] for 
the proof). Denoting element- wise division with "/" (and ^ = 0), let 



^2 



W(Sign(A - K*)/X*) + j^ViUU 1 ) 
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It is easy to check that conditions (a) and (c) in lemma Hare both satisfied for S = prl. To show condition 
(b), first notice that \\Vn±Vq-'Pn(M)\\ 00 < D max \\Vn±(M)\\ 00 and hence, we have 



||7V(Q°**)l|oo< 



1 



(1 - Anax) 2 
1 

max 



n 



1 — fl 



i (1 - Anax) 2 



n 

(i-m)IQI 



V T (Sign(A - K*)/X*) + j^UU T ) o TV (UU T - V T {Z*)) 



1 



(i - Z) max ) s 



(i-m) 



(1 + D max )D n 



k*\Ci\) \d\ 

[l 1 + Dmi 
k* |C m i n |^ 



< 



The last inequality holds for > ^i-sd~ I ^ I )\c ■ \ 2 • ^ OT tne condition (d), we have 



(l + Anax) 



\W r ±(Q)\U < Y- 



V rl _ 



1-a 



' JL^ 1-M IClT 

^fc* |Ci| n n , 

i-/x Vic 2 | jgTT - 



^--^Sign(A - K*)/X* + ^-Vn(UU T ) 
i-/x VlCil |C 2 | 



L |Ci|x|d| 
L |C 2 |x|d| 



n n 



L |Ci|x|C 2 | 



Ji_^ 1-M 1^2 I ^ -i 

k* \C 2 \ n n J 1 \C 2 \x\C 2 \ 



1 — a n 



l-M VlCfc*! |Ci| 

n 1 \C k *\x\C 1 \ 

ICiN 

lldlxldl 

viggTgn -i 

n 1 \C 2 \x\C 1 \ 



^ n 1 |C fc *|x|C 2 | 



'^T" n 1 |Ci|x|C fc ,| 

n 1 |C 2 |x|C fc *| 



fc* |C fc * | n n J L \Ck* \x\C k * I 



V\Cl\ jggj , 
n -MCilxICal 

|C 2 |-, 

^ 1 |C 2 |x|C 2 | 



V|Cfc*l|Ci| . 



L|C fc *|x|Ci| 



V|Cfc*l ggT - 



L|C fc * |x|C 2 | 



i^tt^I 







k^JC 2 ~\ ± \C 2 \x\C 2 \ 




1 — a V n 



+ AH < A 4 - 



n l|Ci|x|C fc *| 

^ l|C 2 |x|C fc *| 

-^l|C fc *|x|C fe *| 





The last inequality holds for 



(l-/i)fc* (l-a;-£> max )fc* 
/m 2 ^ D max Ei|Ci| 2 



as assumed. 



Lemma 5. If a < 1 then TnQ = {0}. 

Proof. We show that the projection TV^nG) has a norm a strictly less than one. Then, if there exists a 
non-zero matrix M e T HO, then = ||7V^WI|oo < «||^||oo < Halloo is a trivial contradiction. 

Let M e ft and consider 



\\V r (M)\\ 00 =max 

i,3 



jo^hCiWC^Mc^Cj + M Ci , Cj l\c d | xic,- 1 - | C lilCiixiCii^Ci.Cj-lic^xiCj-i 
<2D max ||M||oo =a||M||oo. 

The last step is attained by optimizing over |Ci| and \Cj\. This concludes the proof of the lemma. 
Lemma 6. \\W(Vq(Z* - UU T ))\\ 2 < - 1)- 



□ 



Proof For M e fi, we have ||M|| 2 < ||M a || 2 , where, M a e M x/c with (M a )ij = ||M c . )Cj . || 2 . By definition 
of A^, we have ||M Ci ,c, || 2 < D m ^^/\Q\\C~\\\M Ct t Cj I |oo- Thus, 

\\Vn(Z* -UU T )\\ 2 <D ma 



1-1 
10-1 



1 1 

D max (k — 1). 
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The rest of the proof is straight forward as follows 



\\W{Tn{Z* -UU T ))\\ 2 



< 



v r^ Q J2(V T Vny(Vn(Z* - UU T )) 

\i=0 

/ oo 

V Q Y,( V T^y(Vn(Z* - UU T )) 



\t=0 



1 — a 



(k* - 1). 



This concludes the proof of the lemma. 
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